NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Scalable Processing of Moving Flock Patterns

https://doi.org/10.1145/3748777.3748794

Calderon_Romero, Andres Oswaldo; Tsotras, Vassilis; Bakalov, Petko; Vieira, Marcos R (August 2025, ACM)

We present a scalable approach for identifying moving flock patterns in large trajectory databases. A moving flock pattern refers to a group of entities that move closely together within a defined spatial radius for a minimum time interval. We focus on improving the state-of-the-art sequential algorithms, which suffer from high computational costs when dealing with large datasets. By leveraging distributed frameworks and utilizing spatial partitioning, the proposed solution aims to significantly reduce the time required to detect moving flock patterns. We highlight the bottlenecks of the sequential approaches and offer optimizations like partition-based parallelism and strategies for managing flock patterns that span multiple partitions. An experimental evaluation using synthetic trajectory datasets, demonstrates that the proposed methods substantially improve scalability and performance compared to existing sequential algorithms.
more » « less
Free, publicly-accessible full text available August 25, 2026
On scalable DCEL overlay operations

https://doi.org/10.1007/s10707-025-00539-x

Calderon-Romero, Andres; Abdelhafeez, Laila; Trajcevski, Goce; Magdy, Amr; Tsotras, Vassilis J (July 2025, GeoInformatica)

Abstract The Doubly Connected Edge List (DCEL) is an edge-list structure widely used in spatial applications, primarily for planar topological and geometric computations. However, it is also applicable to various types of data, including 3D models and geographic data. An essential operation is theoverlay operation, which combines the DCELs of two input polygon layers and can easily support spatial queries on polygons like the intersection, union, and difference between these layers. However, existing techniques for spatial overlay operations suffer from two main limitations. First, they fail to handle many large datasets practically used in real applications. Second, they cannot handle arbitrary spatial lines that practically form polygons, e.g., city blocks, but they are given as a set of scattered lines. This work proposes a distributed and scalable way to compute the overlay operation and its related supported queries. Our operations also support arbitrary spatial lines through a scalable polygonization process. We address the issues of efficiently distributing the lines and overlay operators and offer various optimizations that improve performance. Our experiments demonstrate that the proposed scalable solution can efficiently compute the overlay of large real datasets.
more » « less
Free, publicly-accessible full text available July 1, 2026
Optimizing Big Active Data Management Systems

Shirazi, Shahrzad; Wang, Xikui; Carey, Michael; Tsotras, Vassilis (March 2025, Proceedings of the 27th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2025) co-located with the 28th International Conference on Extending Database Technology and the 28th International Conference on Database Theory (EDBT/ICDT 2025), Barcelona, Spain, March 25, 2025.)

Within the dynamic world of Big Data, traditional systems typically operate in a passive mode, processing and responding to user queries by returning the requested data. However, this methodology falls short of meeting the evolving demands of users who not only wish to analyze data but also to receive proactive updates on topics of interest. To bridge this gap, Big Active Data (BAD) frameworks have been proposed to support extensive data subscriptions and analytics for millions of subscribers. As data volumes and the number of interested users continue to increase, it is imperative to optimize BAD systems for enhanced scalability, performance, and efficiency. To this end, this paper introduces three main optimizations, namely: strategic aggregation, intelligent modifications to the query plan, and early result filtering, all aimed at reinforcing a BAD platform’s capability to actively manage and efficiently process soaring rates of incoming data and distribute notifications to larger numbers of subscribers.
more » « less
Free, publicly-accessible full text available March 25, 2026
Pyneapple-G: Scalable Spatial Grouping Queries

Abdelhafeez, Laila; Calderon, Andres; Magdy, Amr; Tsotras, Vassilis J (September 2024, Proceedings of the VLDB Endowment)

This paper demonstrates Pynapple-G, an open-source library for scalable spatial grouping queries based on Apache Sedona (formerly known as GeoSpark). We demonstrate two modules, namely, SGPAC and DDCEL, that support grouping points, grouping lines, and polygon overlays. The SGPAC module provides a large-scale grouping of spatial points by highly complex polygon boundaries. The grouping results aggregate the number of spatial points within the boundaries of each polygon. The DDCEL module provides the first parallelized algorithm to group spatial lines into a DCEL data structure and discovers planar polygons from scattered line segments. Exploiting the scalable DCEL, we support scalable overlay operations over multiple polygon layers to compute the layers’ intersection, union, or difference. To showcase Pyneapple-G, we have developed a frontend web application that enables users to interact with these modules, select their data layers or data points, and view results on an interactive map. We also provide interactive notebooks demonstrating the superiority and simplicity of Pyneapple-G to help social scientists and developers explore its full potential.
more » « less
Full Text Available
Automating Data Science Pipelines with Tensor Completion

https://doi.org/10.1109/BigData62323.2024.10825934

Pakala, Shaan; Graw, Bryce; Ahn, Dawon; Dinh, Tam; Mahin, Mehnaz Tabassum; Tsotras, Vassilis; Chen, Jia; Papalexakis, Evangelos E (December 2024, IEEE)

Full Text Available
Principled Mining, Forecasting, and Monitoring of Honeybee Time Series with EBV+

https://doi.org/10.1145/3719014

Hossain, Mst Shamima; Faloutsos, Christos; Baer, Boris; Kim, Hyoseung; Tsotras, Vassilis J (June 2025, ACM Transactions on Knowledge Discovery from Data)

Honeybees, as natural crop pollinators, play a significant role in biodiversity and food production for human civilization. Bees actively regulate hive temperature (homeostasis) to maintain a colony’s proper functionality. Deviations from usual thermoregulation behavior due to external stressors (e.g., extreme environmental temperature, parasites, pesticide exposure) indicate an impending colony collapse. Anticipating such threats by forecasting hive temperature and finding changes in temperature patterns would allow beekeepers to take early preventive measures and avoid critical issues. In that case, how can we model bees’ thermoregulation behavior for an interpretable and effective hive monitoring system? In this article, we propose theprincipledElectronic Bee-Veterinarian Plus (EBV+) method based on the thermal diffusion equation and a novel “sigmoid” feedback-loop (P) controller for analyzing hive health with the following properties: (i) it iseffectiveon multiple, real-world beehive time sequences (recorded and streaming), (ii) it isexplainablewith only a few parameters (e.g., hive health factor) that beekeepers can easily quantify and trust, (iii) it issuesproactivealerts to beekeepers before any potential issue affecting homeostasis becomes detrimental, and (iv) it isscalablewith a time complexity of\(O(t)\)for reconstructing and\(O(t\times m)\)for findingmcuts of a sequence withttime-ticks. Experimental results on multiple real-world time sequences showcase the potential and practical feasibility of EBV+. Our method yields accurate forecasting (up to72%improvement in RMSE) with up to600times fewer parameters compared to baselines (ARX, seasonal ARX, Holt-winters, and DeepAR), as well as detects discontinuities and raises alerts that coincide with domain experts’ opinions. Moreover, EBV+ is scalable and fast, taking less than1 minuteon a stock laptop to reconstruct 2 months of sensor data.
more » « less
Free, publicly-accessible full text available June 30, 2026
Pyneapple-G: Scalable Spatial Grouping Queries

https://doi.org/10.14778/3685800.3685902

Abdelhafeez, Laila; Calderon-Romero, Andres; Magdy, Amr; Tsotras, Vassilis J (August 2024, Proceedings of the VLDB Endowment)

This paper demonstratesPynapple-G, an open-source library for scalable spatial grouping queries based on Apache Sedona (formerly known as GeoSpark). We demonstrate two modules, namely,SGPACandDDCEL, that support grouping points, grouping lines, and polygon overlays. TheSGPACmodule provides a large-scale grouping of spatial points by highly complex polygon boundaries. The grouping results aggregate the number of spatial points within the boundaries of each polygon. TheDDCELmodule provides the first parallelized algorithm to group spatial lines into a DCEL data structure and discovers planar polygons from scattered line segments. Exploiting the scalable DCEL, we support scalable overlay operations over multiple polygon layers to compute the layers' intersection, union, or difference. To showcasePyneapple-G, we have developed a frontend web application that enables users to interact with these modules, select their data layers or data points, and view results on an interactive map. We also provide interactive notebooks demonstrating the superiority and simplicity ofPyneapple-Gto help social scientists and developers explore its full potential.
more » « less
Full Text Available
FUDJ: Flexible User-Defined Distributed Joins

Sevim, Akil; Eldawy, Ahmed; Carman, Preston; Carey, Michael; Tsotras, Vassilis (May 2024, IEEE)

Join operations are crucial in data analysis, but can suffer inefficiency with large datasets and complex non- equality-based conditions. Optimized join algorithms have gained traction in database research to address these challenges. One popular choice for implementing join algorithms is distributed data processing frameworks, e.g., Hadoop and Spark, but each implementation is highly tailored for specific query types. As a result, they do not address join queries that involve diverse and complex conditions since they are not integrated into a holistic query optimization engine like in DBMSs. On the other hand, implementing new join algorithms on a DBMS from scratch requires substantial effort and expertise. This paper introduces FUDJ, Flexible User-defined Distributed Joins, a framework for complex distributed join algorithms. The key idea of FUDJ is to allow developers to realize new distributed join algorithms into the database without delving into the database internals. As shown, an algorithm implemented in FUDJ is up to an order of magnitude faster than existing user-defined implementations with an order of magnitude fewer lines of code.
more » « less
Full Text Available
FUDJ: Flexible User-Defined Distributed Joins

https://doi.org/10.1109/ICDE60146.2024.00320

Sevim, Akil; Eldawy, Ahmed; Carman, E Preston; Carey, Michael J; Tsotras, Vassilis J (May 2024, IEEE)

Join operations are crucial in data analysis, but can suffer inefficiency with large datasets and complex non-equality-based conditions. Optimized join algorithms have gained traction in database research to address these challenges. One popular choice for implementing join algorithms is distributed data processing frameworks, e.g., Hadoop and Spark, but each implementation is highly tailored for specific query types. As a result, they do not address join queries that involve diverse and complex conditions since they are not integrated into a holistic query optimization engine like in DBMSs. On the other hand, implementing new join algorithms on a DBMS from scratch requires substantial effort and expertise. This paper introduces FUDJ, Flexible User-defined Distributed Joins, a framework for complex distributed join algorithms. The key idea of FUDJ is to allow developers to realize new distributed join algorithms into the database without delving into the database internals. As shown, an algorithm implemented in FUDJ is up to an order of magnitude faster than existing user-defined implementations with an order of magnitude fewer lines of code.
more » « less
Full Text Available
SGPAC: Generalized Scalable Spatial GroupBy Aggregations over Complex Polygons

https://doi.org/10.1007/s10707-023-00491-8

Abdelhafeez, Laila; Magdy, Amr; Tsotras, Vassilis J. (October 2023, GeoInformatica)

This paper studies the spatial group-by query over complex polygons. Given a set of spatial points and a set of polygons, the spatial group-by query returns the number of points that lie within the boundaries of each polygon. Groups are selected from a set of non-overlapping complex polygons, typically in the order of thousands, while the input is a large-scale dataset that contains hundreds of millions or even billions of spatial points. This problem is challenging because real polygons (like counties, cities, postal codes, voting regions, etc.) are described by very complex boundaries. We propose a highly-parallelized query processing framework to efficiently compute the spatial group-by query on highly skewed spatial data. We also propose an effective query optimizer that adaptively assigns the appropriate processing scheme based on the query polygons. Our experimental evaluation with real data and queries has shown significant superiority over all existing techniques.
more » « less
Full Text Available

« Prev Next »

Search for: All records